Assessment of potential transthyretin amyloid cardiomyopathy cases in the Brazilian public health system using a machine learning model

Objectives To identify and describe the profile of potential transthyretin cardiac amyloidosis (ATTR-CM) cases in the Brazilian public health system (SUS), using a predictive machine learning (ML) model. Methods This was a retrospective descriptive database study that aimed to estimate the frequency of potential ATTR-CM cases in the Brazilian public health system using a supervised ML model, from January 2015 to December 2021. To build the model, a list of ICD-10 codes and procedures potentially related with ATTR-CM was created based on literature review and validated by experts. Results From 2015 to 2021, the ML model classified 262 hereditary ATTR-CM (hATTR-CM) and 1,581 wild-type ATTR-CM (wtATTR-CM) potential cases. Overall, the median age of hATTR-CM and wtATTR-CM patients was 66.8 and 59.9 years, respectively. The ICD-10 codes most presented as hATTR-CM and wtATTR-CM were related to heart failure and arrythmias. Regarding the therapeutic itinerary, 13% and 5% of hATTR-CM and wtATTR-CM received treatment with tafamidis meglumine, respectively, while 0% and 29% of hATTR-CM and wtATTR-CM were referred to heart transplant. Conclusion Our findings may be useful to support the development of health guidelines and policies to improve diagnosis, treatment, and to cover unmet medical needs of patients with ATTR-CM in Brazil.


Introduction
Amyloidosis are a group of protein misfolding disorders in which misfolded proteins form insoluble amyloid fibrils that deposit in the tissues leading to organ damage and dysfunction (1).Two types of amyloid account for 95% of cardiac amyloidosis: light-chain amyloid (AL) due to immunoglobin light-chain deposition and transthyretin (TTR) cardiac amyloidosis (ATTR-CM), which can be due to hereditary mutation (hATTR) or wild-type transthyretin (wtATTR) [1].Hereditary transthyretin cardiac amyloidosis (hATTR-CM) is caused by one of the known heritable (autosomal dominant) mutations in the TTR gene, while wtATTR-CM (also known as senile or senile systemic amyloid CM) is caused by age-related changes in the wild-type TTR [1].ATTR-CM is an under-recognized cause of heart failure (HF) in older adults, an important cardiovascular disease (CVD), which are still the leading cause of death worldwide.Although ATTR-CM is considered a rare cardiac disease, recent studies have shown a prevalence up to 13% of patients hospitalized with HF and preserved ejection fraction (HFpEF) [2]; 16% of patients with aortic stenosis undergoing transcatheter valve replacement [2]; 7-8% of patients undergoing carpal tunnel release surgery [3]; and 17% of older adults with HFpEF in an autopsy series [4].Data on the epidemiology of the disease in Brazil are scarce, especially in the public health setting.
The natural history of ATTR-CM includes progressive HF, complicated by arrhythmias and conduction system disease.The clinical course is more variable for those with hATTR compared with wtATTR [1].The hereditary form of the disease usually manifests itself after the age of 47, with a median survival ranging from 2 to 6 years after diagnosis (depending on genotype) for untreated patients [1,[5][6][7] due to its low penetrance.On the other hand, wtATTR-CM is a disease that predominately affects men >60 years of age, with a median survival ranging from 3.5 to 5 years after diagnosis in untreated patients (depending on the stage of the disease) [1,[5][6][7].Diagnostic delays, which remain common in the current treatment landscape, are associated with particularly poor prognosis [8].The diagnosis of ATTR-CM is challenging for several reasons, including the similarity of symptoms with HF, a prevalent and common disease, especially among older adults, and the unfamiliarity of clinicians with the disease and its appropriate diagnostic algorithm [1].Misdiagnosis is common in ATTR-CM, contributing to diagnostic delays and risking both further disease progression and treatment with ineffective and potentially harmful therapies [9].Management of cardiac amyloidosis is complex and specific for the type of amyloidosis that affects the patient.In Brazil, the only treatment approved for ATTR-CM is tafamidis, a TTR stabilizer that binds the thyroxinebinding sites of TTR with high affinity and selectivity, slowing dissociation of TTR tetramers into monomers, therefore inhibiting aggregation.
Given the misdiagnosis and significant morbidity of ATTR-CM and availability of treatment with TTR stabilization, it is essential to identify those ATTR-CM patients who are potentially under-recognized.Machine learning (ML) models applied to CVD [10][11][12] and based on medical claims data for the prediction of diseases and phenotypes have been described in the medical literature with increasing frequency [13][14][15][16][17].In this way, this study aimed to identify and describe the profile of potential ATTR-CM patients treated in the Brazilian public health system (SUS) and registered in inpatient and outpatient databases from DATASUS, using a predictive machine learning model.

Study design
This is a supervised machine learning study based on data analysis from a retrospective administrative outpatient and hospitalization databases that aimed to classify the database information and estimate the frequency of potential ATTR-CM cases in the Brazilian public health system model.The period of the analysis was from January 1, 2015, until December 30, 2021, in the database.The supervised model uses training datasets that contain information on the desired output (label; true outcome), i.e., the model learns from labelled training data how to predict the desired outcome.In this study, labelled data defined "reference ATTR-CM cases" and "not ATTR-CM" cases, based on criteria set by the investigators, validated by an expert panel.Then, the model classified and predicted the frequency among those cases labelled as "not ATTR-CM" which could be under-recognized ATTR-CM cases (ATTR-CM-like cases) (S1 Fig) .The "reference" cases were those most likely to have a confirmed ATTR-CM diagnosis, while the "like" cases (which were the cases potentially classified as diagnosis related to ATTR-CM) were those most likely to be under-recognized ATTR-CM cases.
• Potential "ATTR-CM-like" cases were those not defined as ATTR-CM in the first step of the ML approach (i.e., defined as "not ATTR-CM cases") but classified as ATTR-CM by the algorithm.For those patients, the criteria set for case classification were: • For wtATTR-CM: >• Patients with at least one claim with any cardiac-related ICD-10 code (Table 1) AND at least one claim of the secondary procedures AND at least one mandatory procedure claim (S1 Table ).
>• Patients with at least one claim with wtATTR-related ICD-10 codes (Table 1) AND at least one mandatory OR secondary procedure (S1 Table ).
• For hATTR-CM: patients with at least one claim with hATTR-related ICD-10 codes (Table 1) AND at least one mandatory OR one secondary procedure (S1 Table ).
Once the potential "ATTR-CM-like" patients were filtered, the ML model evaluated these patients and classified it as "ATTR-CM like".The ICD-10 code related to amyloidosis was done based on previously published articles using claims data [18,19].

Data sources and feature selection
In Brazil, SUS is the universal healthcare system of which more than 75% (~150 million) of the population are exclusively dependent on.However, the remaining 25% of the population have supplementary private health insurance plans, and therefore may access SUS episodically [20].This study was based on outpatient and inpatient administrative data from DATASUS, the Informatics Department of SUS, body responsible for collecting, processing, and disseminating healthcare data in Brazil [21].Therefore, our study includes data from procedures performed in SUS perspective, which covers from 150 million until total population in Brazil (205 million inhabitants).Two datasets were considered: the Inpatient Information System (SIH [Sistema de Informações Hospitalares]) and Outpatient Information System (SIA [Sistema de Informações Ambulatoriais]).SIH and SIA are administrative databases for reimbursement purposes, not being able to analyze patient information related to medical charts level of details [22,23].The details of contents in the databases are described elsewhere [24] (S2 Fig) .Due to its administrative nature, SIH and SIA do not contain clinical data (e.g., signs and symptoms).Thus, cause of admission (as per International Classification of Diseases (ICD) 10 code) and procedures performed during the hospitalization were used as predictor variables.Additionally, data related to patient's age, state of residence, hospitalization, and outpatient visits date, diagnosis at entry (ICD based), procedures prescribed and performed, and in-hospital length of stay (days) were also extracted.
To build the model, a preliminary ICD-10 code list that could be potentially related to ATTR-CM (i.e., ICD-10 codes most presented or related to ATTR-CM) was created based on literature review.Subsequently, this list was validated by a group of ATTR-CM Brazilians experts, considering local caveats and codes most frequently used, and the final list with the ICD codes generated is presented at Table 1.
Different types of procedures were also selected based on the literature review [25,26].Codes and procedures reference names were collected from the Management System of Procedures, Medications and OPM of the Unified Health System (SIGTAP), which are the standard procedures approved within SUS [27].The list of the selected procedures also validated by ATTR-CM experts considering local reality (S2 Table ).The experts assessed each model step results to identify possible limitations (bias sources for labeling construction and confounding variables).

Study population
Considering wtATTR-CM disease profile and its occurrence on patients that were 50 years [1] or older, as previously done in a similar study in the USA [25].Therefore, for the wtATTR-CM cohort, were considered patients aged � 50 years at index date (date of first procedure claim related to the ICDs selected as potentially associated with wtATTR-CM); with wtATTR-CMrelated ICD-10 codes and any of the cardiac-related ICD-10 codes listed in Table 1 during the study period.And for the hATTR-CM cohort, were included patients aged � 18 years at index date, with hATTR-CM-related ICD-10 codes during the study period.

Linkage methods
Even though, Brazilian publicly available information from health information systems databases do not use a key standard identifier that allow observations on patient-level crossing different databases per legislation to guarantee data privacy, some outpatient databases have unique patient encrypted code (key identifier), which allows a probabilistic linkage approach with the hospitalization dataset.
Therefore, we performed a of the lack of patient match key, a probabilistic record linkage method to allow longitudinal assessment using SIH and SIA, following multiple steps with different combination of patient information from both databases, such as date of birth, city and ZIP code [29].Before each step, a data cleaning was performed to keep only good quality claims for linkage.About 5% of all patient records were discharged from analysis due to low quality information.This approach, however, enables an assessment of each patient's longitudinal record and thus allows us to evaluate their journey across the system.

Statistical methods
For the ML model approach, the selected data was separated in train (60% of patients), validation (20% of patients), and test (20% of patients) datasets (S3 Fig) .During the training step, the model learned the patterns of the used data.The validation step, then, was used to decide the most suitable algorithm and the test was the final validation of the model.A supervised learning algorithm was fitted in the training set to learn the pattern of ATTR-CM and not ATTR-CM cases.We tested three different supervised algorithms (logistic regression, Support Vector Machine, XGBoost, and Random Forest), so we could choose the one with the best performance.A K-fold cross validation was performed to get the best model parameters and control overfitting.After evaluating the result of the best model in the validation set, it was also evaluated in the test set, to make sure this was the best model.The machine learning model approach is represented in S3 Fig. Data analysis was performed considering numerical and categorical variables.The continuous variables were described as measures of central tendency (mean, median) and spread, including the range, quartiles, absolute deviation, variance, and standard deviation, as applicable.The categorical variables were described as counts and percentages.The age variable was calculated based on the difference between the date of birth and the first ICD-10 code of interest reported (index date).The age was described as a continuous variable, including the mean, standard deviation, median and interquartile ranges; and by age groups (absolute number and proportion per category).The demographic variables were described as categorical variables, with absolute frequencies and percentage, as well as the frequency of the selected ATTR-CM-related ICD-10 codes.The proportion of ATTR-CM-reference and ATTR-CM-like cases among potential ATTR-CM cases in SUS was described by ATTR-CM type (hereditary and wild-type) per year.The model performance metrics were also evaluated, considering its accuracy, sensitivity, and specificity.The model performance metrics was also evaluated, considering its accuracy, sensitivity, and specificity, as follows: 1) Accuracy: the relationship between predicted vs. actual value, i.e., closeness of predicted value to the actual value; 2) Sensitivity: measured using predicted values of the output model with respect to changes in the input of the given model.It also computes the significance of attributes to obtain correct output; 3) Specificity: it is related to degree of confidence.The description of the equations used are included in S4 Fig.
Time of follow-up was calculated based on the difference between date of first claim of ICD-10 code of interest and the last date of patient information available at database.The annual hospitalization rate was described as the number of ATTR-CM-related hospitalizations per 100.000inhabitants per each study year.The therapeutic itinerary was presented as the number and proportion of patients with record of tafamidis, heart transplant or liver transplant during the study period.The resource utilization per patient was summarized as the mean (SD) and median (IQR) number of hospital admissions and outpatient visits per each patient; and the resource utilization per patient per year (PPPY) was calculated as the median (95%CI) number of procedures divided by each patient's follow-up time in years, according to the formula:  2).That is, the model classified a total of 265 potential hATTR-CM patients (reference and like cases).The prevalence of hATTR-CM among hATTR patients was 24.8%, considering the 213 reference patients and the 860 individuals in the initial hATTR cohort.

Study dataset construction and classification of the ATTR-CM cases
Considering the construction of the wtATTR-CM cohort, 938,385 individuals were aged � 50 years old and had at least one claim with wtATTR-CM ICD-10 codes or at least one claim with cardiac-related ICD-10 codes, being therefore included in the wtATTR-CM initial cohort.Of these, 203 were classified as reference-wtATTR-CM and 6,177 were classified as potential wtATTR-CM cases in the first step of the ML model (Fig 1).In the final step of the ML model, of the 6,177 cases classified as potential in the first step, 1,378 (21.6%) were classified as wtATTR-CM-like cases and 4,799 (75.22%) as non wtATTR-CM cases (Table 2).That is, the model classified a total of 1,581 potential wtATTR-CM patients (reference and like cases).The prevalence of wtATTR-CM cases was 21.6 cases per 100.000patients � 50 years old with cardiac-related ICD-10 codes, considering the 203 reference patients and the 938,385 individuals in the initial wtATTR-CM cohort.It is important to notice that this prevalence is based on an initial cohort of patients with cardiac failure and related diseases, and not on the overall Brazilian population.

Machine learning model performance
The final validated model was applied to both hATTR-CM and wtATTR-CM datasets.The model classified the ATTR-CM cases as reference, potential or not ATTR-CM.For hATTR-CM cohort, the final validated model predicted 95.35% of hATTR-CM cases and 75.47% of not hATTR-CM cases and had an accuracy of 84.35% (Table 3).For the wtATTR-CM cohort (n = 6,380), the final validated model predicted 84.62% of wtATTR-CM cases and 77.85% of not wtATTR-CM cases and had an accuracy of 78.06% (Table 3).

Demographic characteristics of ATTR-CM-reference and ATTR-CM-like patients
Overall, median age of hATTR-CM patients was 66.8 years (interquartile range [IQR] 50.5-70.3).In the reference group, median age was 66.8 (IQR 52.8-74.1)years, while in hATTR-CM-like group it was 65.9 (IQR 42.2-71.0).Most patients were over 60 years old over all groups, but a higher proportion of the age group from 30 to 49 years was observed in hATTR-CM-like.There were most males in general (58.8%), as in the reference hATTR-CM (58.2%) and hATTR-CM-like (61.2%) groups (Table 4).
The median age of wtATTR-CM patients was overall 59.9 years (IQR 55. 1-66.3).In the wtATTR-CM reference group, median age was 65.9 (IQR 58.4-73.8)years, while in wtATTR-CM-like group it was 59.2 (IQR 54.8-65.2),demonstrating an opportunity to properly diagnose these potential patients while they're in a less advanced age.Most patients were under 70 years old over all groups, and the wtATTR-CM group had the higher proportion of individuals from 50 to 59 years old (49.1%).Males were the majority overall (62.1%), as well as in the hATTR-CM reference (58.6%) and hATTR-CM-like (62.6%) groups (Table 4).

Annual ATTR-CM hospitalization rate
Higher hospitalization rates were observed in the hATTR-CM reference compared to hATTR-CM-like group.The years with the higher hospitalization rates were 2017 and 2018, for hATTR-CM reference, and 2017, 2019 and 2020 for hATTR-CM-like cohort (Fig 4).The opposite was observed for wtATTR-CM-like patients, which had a higher rate of hospitalization throughout the study period compared to wtATTR-CM reference patients.The years with the higher hospitalization rates were 2017 and 2018, for the wtATTR-CM reference, and 2015 and 2018 for wtATTR-CM-like cohort (Fig 5).6).
Outpatient setting.In the outpatient setting, hATTR-CM-like patients had more outpatient visits compared to hATTR-CM reference.For the entire cohort (n = 262), 244 (93.1%) patients had at least one record of outpatient visit related to the hATTR.The proportion of patients with record of outpatient visits was higher in hATTR-CM-like (98.0%) compared to hATTR-CM reference (92.0%).Median number of outpatient visits per patient was 8.0 (IQR 2.8-20) for all patients.Median number of outpatient visits PPPY was 5.0 (IQR 2.0-8.0) for all patients.However, analysis by disease type revealed that hATTR-CM reference had less outpatient visits PPPY compared to hATTR-CM-like (5.0 [IQR 2.0-7.5] and 6.3 [IQR 1.0-8.7],respectively) (Table 6).
Outpatient setting.In the outpatient setting, outpatient visits seem similar across disease types.For the entire cohort (n = 1,581), 1,344 (85.0%) patients had at least one record of outpatient visit related to the wtATTR-CM.The proportion of patients with record of outpatient visits was higher in wtATTR-CM reference (87.2%) compared to wtATTR-CM-like (84.7%).Overall, the median number of outpatient visits per patient was 2.0 (IQR 1.0-4.0)for all hATTR-CM: hereditary transthyretin amyloid cardiomyopathy; wtATTR-CM: wild-type transthyretin amyloid cardiomyopathy; IQR: interquartile range; PPPY: per patient per year; SD: standard deviation. 1Only inpatient or outpatient claims with the ICD-10 codes selected for the study (ATTR-CM-related or cardiac-related).
2 Any treatment claim with the ICD-10 codes selected for the study (claims restricted to the selected ICD-10 codes).patients.Median number of outpatient visits PPPY was 1.0 (IQR 1.0-2.0)for all patients and across disease types (Table 6).Therapeutic itinerary.For the entire cohort (n = 1,518), 80 (5.1%) patients had at least one claim for tafamidis meglumine, of which 6 (3.0%) were in the wtATTR-CM reference group, and 74 (5.4%) were in the wtATTR-CM-like group.Overall, 28.9% of the wtATTR-CM cohort had record of heart transplant.Analysis by disease type revealed that from these, 1 (0.5%) was in the wtATTR-CM reference group and 456 (33.1%) were in the wtATTR-CMlike group.The median number of ATTR-CM-related treatment claims during the study period was 3.0 (IQR 2.0-6.0) for the entire cohort, with similar trends between wtATTR-CM reference and wtATTR-CM-like.Median number of treatment claims PPPY was 1.5 (IQR 1.0-2.1)for all patients, varying from 1.4 (IQR 1.0-2.0) in the wtATTR-CM reference cohort to 1.7 (IQR 1.0-2.2) in the hATTR-CM-like group (Table 6).
Procedure metrics.The patterns of procedures claims were very similar across disease types and for the entire cohort.Median number of procedures claims during the study period was 1.0 (IQR 1.0-2.0)and median PPPY was 1.0 (IQR 1.0-1.0).Were considered only claims of procedures defined for this study and related to the selected study ICD-10 codes (Table 6).

Discussion
This study used a validated ML model to identify potential ATTR-CM cases in Brazilian National Health System (SUS).The results allowed to characterize demographically the ATTR-CM patients and to assess the proportion of ATTR-CM-reference and ATTR-CM-like cases among potential ATTR-CM cases.In addition, the study results showed the ICD-10 codes most presented as ATTR-CM-like cases in DATASUS, the annual hospitalization rate, the treatment patterns of ATTR-CM and ATTR-CM-like cases under SUS treatment and, finally, the average HCRU of ATTR-CM-reference and ATTR-CM-like cases.To the best of our knowledge, this is the first time that a ML model is used to assess potential ATTR-CM cases in the Brazilian national health system.
Overall, our final validated ML model had a good performance for classifying ATTR-CM cases from a retrospective analysis approach, in line with other predictive studies using ML models [13,17,25].The accuracy was 78.06% for wtATTR-CM and 84.4% for hATTR-CM.We identified possible under-recognized ATTR-CM cases from the data available in DATASUS.According to the final classification of the ML model, 10.3% of hATTR-CM patients and 21.6% of wtATTR-CM patients (potential ATTR-CM cases) may have been under-recognized between 2015 and 2021.The delay in the diagnoses and the misdiagnoses of ATTR is often reported in literature [25,30,31].The reasons for diagnostic delay are multifactorial and include symptom overlap with other conditions, low disease awareness, the historical need for invasive diagnosis, and until recently the lack of a disease-modifying treatment [30].A previous study also identified under-recognized ATTR-CM cases in 4 different databases from USA, using ML model [25].The authors highlighted the importance of ML model as a tool to help in the early diagnostic, resulting in a good prognostic of the disease [25].
A smaller proportion of under-recognized cases was identified in the hATTR-CM cohort compared to wtATTR-CM, probably because this cohort was built over a more restrict population, that is, patients with hereditary amyloidosis ICD-10 codes.On the other hand, in the wtATTR-CM cohort there was a higher proportion of underdiagnosed patients.While only 203 patients were classified as reference by the ML model, 1,581 patients were classified as wtATTR-CM-like cases, which may be indicative of a higher prevalence of the disease among older adults than expected.Previous studies have demonstrated that the clinical overlap between wtATTR-CM and other heart failure aetiologies is high [25,30,32], what can explain our finding.
Diagnosing ATTR-CM can be difficult, mainly for the wild-type, as cardiac symptoms are consistent with more common types of heart failure and the extra-cardiac manifestations are heterogeneous and nonspecific [30].In one study developed in Spain, 30% of the patients with ATTR-CM had previously been misdiagnosed with other cardiac diseases such as: hypertensive heart disease, hypertrophic cardiomyopathy and ischemic heart disease [32].In our study, the ICD-10 codes most related with ATTR-CM presented a similar profile to these findings.Some patients with hATTR-CM may have a mixed phenotype, with cardiac and neurological manifestations.In our study, the prevalence of cardiac manifestations among hATTR patients was 24.8%, which is in line with the literature [33].Previous studies have demonstrated that in patients with mixed phenotype the diagnostic delay is shorter than in patients with only cardiac manifestations [30,34].The early diagnosis can result in a better disease prognosis and in an adequate treatment, since a disease-modifying treatment is available for ATTR-CM treatment in Brazil.
Regarding the geographic distribution of the patients, a higher concentration was observed in the South and Southeast regions in this study, probably because these are the Brazilian regions with the higher number of specialized hospitals and clinics [35,36].Access to medical care in Brazil is widely influenced by the concentration of services in large urban centres [37].The territorial extension of Brazil makes it even more important and challenging to provide a highly coordinated multi-layered healthcare system [37].Therefore, it was expected a higher proportion of patients referred to more developed regions.
Concerning the therapeutic itinerary, tafamidis meglumine is the only specific drug treatment available in SUS for ATTR-CM treatment.In addition to tafamidis meglumine, heart transplantation is also available, however, due to its invasive characteristic it is considered only in extreme cases [15].For the hATTR-CM patients, 13% received treatment with tafamidis meglumine and there was no record of heart transplant, while for wtATTR-CM patients only 5% were treated with tafamidis meglumine and 29% were referred to a heart transplantation.This data call attention, one more time, for the importance of an early diagnosis in the disease progression.wtATTR-CM patients, once early diagnosed, could have received the drug treatment, avoiding the heart transplantation [26,31].It is important to note that for hATTR-CM patients, although they were expected to have heart and liver transplantation [38,39], it was not identified in the study period.However, procedures related to the management of posttransplant patient and the use of tacrolimus, a drug used to prevent transplant organ rejection, appeared among the most common procedures i.e., these patients may have had transplantation before 2015 and were performing the maintenance during the study period.Another consideration refers to low number of heart and liver transplant in the study period and the introduction of tafamidis meglumine in SUS in 2016, which could be related, as previously demonstrated in a 20-year retrospective study of the Familial Amyloidosis Polyneuropathy World Transplant Registry [38].
The hospitalization rate and the resource utilization also evidence unmet medical needs, especially for wild-type patients; although a formal comparison was not performed, the hospitalization rate was much higher in the ATTR-CM-like group than in the ATTR-CM-reference group.These patients may have more hospitalizations records because lack of correct diagnosis and, consequently, lack of proper disease management.A previous study demonstrated that the use tafamidis meglumine was associated with a lower rate of hospitalization as well as a shorter length of stay per hospitalization among all treated patients, mainly when the treatment was initiated in patients at early disease stage [40].
Moreover, reductions in hospitalization rates might occur due to several factors [41].In 2020 and 2021, however, these factors were compounded by the effect of the pandemic caused by the novel coronavirus [41].In Brazil, the entire healthcare system was impacted, not only by the demand for care of COVID-19 cases, but also by the isolation and social distancing measures that compromised people's access to healthcare services [41].In this study, we identified a reduction in hospitalization rates for both wtATTR-CM and hATTR-CM cohorts, probably related to the isolation and social distancing arising from COVID-19 pandemic.
Our study has some limitations.The use of retrospective data from administrative sources did not allow us to explore deeply and assess properly potential label biases and confounding variables, since information's like clinical data (e.g., signs and symptoms) were not available.To mitigate this, we used an expert panel assessment for each step of model labels construction, validation, classification performed, and frequency results obtained.
Additionally, there is an intrinsic limitation on the use of retrospective data is that the data, which are often incomplete, and this study depended on the quality and filling of non-mandatory data available [42].
Due to the administrative characteristics of the databases that were used, few clinical information was available, therefore the only specific predictive variables for identifying ATTR-CM-related cases in the model were ICD-10 code and procedures performed.Another important limitation is that DATASUS uses the International Classification of Diseases (10 th version), which does not have a code for ATTR-CM, so we conducted the study based on the assumption of using a set of parameters validated by experts and key opinion leaders (including ICD-10 code and clinical procedures performed) for predicting ATTR-CM cases.The absence of laboratory test results available in the datasets were also a limitation for the label construction.
To reduce the probability of including patients with diseases different from ATTR-CM, we have defined, along with the expert pane, very specific mandatory procedures for classifying ATTR-CM and ATTR-CM-like cases.This might have an impact in the number of patients identified by the model for a few reasons.
Firstly, these are more expensive and specialized procedures.In the context of the Brazilian public health system, it is expected that there are barriers to access these type of procedures [43].Therefore, it is supposed that there is a lower number of patients undergoing these procedures, which might have been reflected in the model results.
Secondly, we believe that the suspicion of amyloidosis in patients with heart failure is likely to be restricted to large centres (e.g., teaching hospitals, specialized clinics), as many physicians may not be familiar with ATTR-CM management [1], which can result in fewer patients who underwent diagnostic confirmation procedures.
Blind spots in machine learning can reflect the worst societal biases, with a risk of unintended or unknown accuracies in minority subgroups, and there are concerns over the potential for amplifying biases present in the data collected, which might lead to discriminatory bias [44].In this context, the assessment of the expert panel in all model steps might have contributed to mitigating this influence.

Conclusion
The outcomes found in this study supported the identification of potential ATTR-CM cases in DATASUS using a validated ML model, reflecting the public health system in Brazil.In our study, we were able to characterize this population demographically, clinically (considering their ICD-10 codes and procedures performed), and to identify the HCRU related to ATTR-CM management.The use of ML as a tool to identify potential patient of underdiagnosed diseases can be a hallmark for public health resources allocation and medical education strategies.In addition, our findings may be useful to support the development of health guidelines and policies to improve diagnosis, treatment and to cover unmet medical needs of patients with ATTR-CM in Brazil.

A total of 1 ,
508,468 individuals with claims for the selected ICD-10 codes were identified in the database from 2015 to 2021.From those, 2,107 (0.14%) were excluded due to blood cancer, end-stage renal disease or cerebral amyloid angiopathy.Thus, 1,506,361 individuals were considered to start the construction of the hATTR-CM and wtATTR-CM cohorts (Fig 1).Of the 1,506,361 individuals in the initial cohort, 860 were aged � 18 years old and had at least one claim with hATTR ICD-10 codes, composing the hATTR-CM initial dataset.Of these, 477 hATTR-CM cases were identified from 2015 to 2021, of which 213 were classified as reference-hATTR-CM and 264 were classified as potential hATTR-CM cases in the first step of the ML model (Fig 1).Finally, among those cases classified as potential hATTR-CM cases in the first step of the algorithm, 49 (10.27%) were classified as hATTR-CM-like cases and 215 (45.07%) were classified by the ML model as non hATTR-CM cases (Table

Table 3 . Model performance in final test.
proportion of correct classifications that a trained machine learning model achieves, i.e., the number of correct predictions divided by the total number of predictions across all classes; Sensitivity: measures the proportion of true positives thar are correctly identified by the model; Specificity: measures the proportion of true negatives that are correctly identified by the model.https://doi.org/10.1371/journal.pone.0278738.t003